## [1] 113937 81
The Prosper loan dataset contains 81 variables, with almost 114000 observations. However, not all of the variables are valuable to explore. In the following analysis, I would focus on loan amount, loan original date, estimated return, current loan status, Prosper rating, Prosper score, listing category,borrower APR, borrower income, borrower employment status,borrower occupation, borrower home owner, borrower credit history, borrower state,borrower total Prosper loans.
## 'data.frame': 113937 obs. of 21 variables:
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric.: int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## LoanOriginalAmount LoanOriginationDate EstimatedReturn
## Min. : 1000 2014-01-22 00:00:00: 491 Min. :-0.183
## 1st Qu.: 4000 2013-11-13 00:00:00: 490 1st Qu.: 0.074
## Median : 6500 2014-02-19 00:00:00: 439 Median : 0.092
## Mean : 8337 2013-10-16 00:00:00: 434 Mean : 0.096
## 3rd Qu.:12000 2014-01-28 00:00:00: 339 3rd Qu.: 0.117
## Max. :35000 2013-09-24 00:00:00: 316 Max. : 0.284
## (Other) :111428 NA's :29084
## LoanStatus ProsperRating..Alpha. ProsperScore
## Current :56576 :29084 Min. : 1.00
## Completed :38074 C :18345 1st Qu.: 4.00
## Chargedoff :11992 B :15581 Median : 6.00
## Defaulted : 5018 A :14551 Mean : 5.95
## Past Due (1-15 days) : 806 D :14274 3rd Qu.: 8.00
## Past Due (31-60 days): 363 E : 9795 Max. :11.00
## (Other) : 1108 (Other):12307 NA's :29084
## ListingCategory..numeric. BorrowerAPR IncomeRange
## Min. : 0.000 Min. :0.00653 $25,000-49,999:32192
## 1st Qu.: 1.000 1st Qu.:0.15629 $50,000-74,999:31050
## Median : 1.000 Median :0.20976 $100,000+ :17337
## Mean : 2.774 Mean :0.21883 $75,000-99,999:16916
## 3rd Qu.: 3.000 3rd Qu.:0.28381 Not displayed : 7741
## Max. :20.000 Max. :0.51229 $1-24,999 : 7274
## NA's :25 (Other) : 1427
## StatedMonthlyIncome DebtToIncomeRatio EmploymentStatus
## Min. : 0 Min. : 0.000 Employed :67322
## 1st Qu.: 3200 1st Qu.: 0.140 Full-time :26355
## Median : 4667 Median : 0.220 Self-employed: 6134
## Mean : 5608 Mean : 0.276 Not available: 5347
## 3rd Qu.: 6825 3rd Qu.: 0.320 Other : 3806
## Max. :1750003 Max. :10.010 : 2255
## NA's :8554 (Other) : 2718
## EmploymentStatusDuration Occupation
## Min. : 0.00 Other :28617
## 1st Qu.: 26.00 Professional :13628
## Median : 67.00 Computer Programmer : 4478
## Mean : 96.07 Executive : 4311
## 3rd Qu.:137.00 Teacher : 3759
## Max. :755.00 Administrative Assistant: 3688
## NA's :7625 (Other) :55456
## IsBorrowerHomeowner CreditScoreRangeLower CreditScoreRangeUpper
## False:56459 Min. : 0.0 Min. : 19.0
## True :57478 1st Qu.:660.0 1st Qu.:679.0
## Median :680.0 Median :699.0
## Mean :685.6 Mean :704.6
## 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :880.0 Max. :899.0
## NA's :591 NA's :591
## MonthlyLoanPayment BorrowerState TotalProsperLoans LenderYield
## Min. : 0.0 CA :14717 Min. :0.00 Min. :-0.0100
## 1st Qu.: 131.6 TX : 6842 1st Qu.:1.00 1st Qu.: 0.1242
## Median : 217.7 NY : 6729 Median :1.00 Median : 0.1730
## Mean : 272.5 FL : 6720 Mean :1.42 Mean : 0.1827
## 3rd Qu.: 371.6 IL : 5921 3rd Qu.:2.00 3rd Qu.: 0.2400
## Max. :2251.5 : 5515 Max. :8.00 Max. : 0.4925
## (Other):67493 NA's :91852
## NULL
Pick those absolute correlation efficient higher than 0.4 to analyze
## Warning: Removed 1491 rows containing non-finite values (stat_smooth).
## Warning: Removed 3422 rows containing missing values (geom_point).
## Warning: Removed 947 rows containing non-finite values (stat_smooth).
## Warning: Removed 947 rows containing missing values (geom_point).
## Warning: Removed 753 rows containing missing values (geom_point).
## Warning: Removed 266 rows containing non-finite values (stat_smooth).
## Warning: Removed 266 rows containing missing values (geom_point).
From the analysis above, we can find the key parameters are the borrower APR and the loan original amount, with whom other parameters have relation.
Then I would analyze other factors relation with them.
To simplify the analysis, I would use practical experience to reallocate those factors:
For loan status, we can set 4 case: In process contains current and Final payment in process; Past Due contains all Past Due; completed; Bad debt contains defaulted and chargedoff. However, since “completed” could used to be any other case, I would only analyze on the other.
For Loan Origination year, we can split it as pre 2009 and post 2009, due to the financial crisis.
Now make a pair comparement bewteen Loan original amount v.s. other factors
we can find all factors above affected OriginalAmount.
Since AA data peak at 15000, we would exclude to analyze again
Since we have already analyze the relationship between BorrowerAPR and Prosper Rating before, we know focus on other two factors.
## [1] "BorrowerAPR" "status" "year"
we can find all factors above affect BorrowerAPR.
# Bivariate Analysis
## [1] -2286.053
The current and Finalpayment in process is in the lower region, the Past Due and Bad debt is in the upper region.
## [1] -3404.868
The Monthly Payment decided by loan original amount, interests rate and payment duration. The interests rate would be influenced by market, so there are 3 region in the map, indicate 3 different market rates. In each region, we can find better loan status will have a lower slope between loan orignal amount and monthly payment, which indicated a longer duration for higher loan status.
## [1] -331.8527
## [1] -9446.689
## [1] -198340.4